A method of storing vector data in compressed form using clustering
Annotation
The development of the machine learning algorithms for information search in recent years made it possible to represent text and multimodal documents in the form of vectors. These vector representations (embeddings) preserve the semantic content of documents and allow the search to be performed as the calculation of distance between vectors. Compressing embeddings can reduce the amount of memory they occupy and improve computational efficiency. The article discusses existing methods for compressing vector representations without loss of accuracy and with loss of accuracy. A method is proposed to reduce error by clustering vector representations using lossy compression. The essence of the method is in performing the preliminary clustering of vector representations, saving the centers of each cluster, and saving the coordinate value of each vector representation relative to the center of its cluster. Then, the centers of each cluster are compressed without loss of accuracy, and the resulting shifted vector representations are compressed with loss of accuracy. To restore the original vector representations, the coordinates of the center of the corresponding cluster are added to the coordinates of the displaced representation. The proposed method was tested on the fashion-mnist-784-euclidean and NYT-256-angular datasets. A comparison has been made of compressed vector representations with loss of accuracy by reducing the bit depth with vector representations compressed using the proposed method. With a slight (around 10 %) increase in the size of the compressed data, the absolute value of the error from loss of accuracy decreased by four and two times, respectively, for the tested sets. The developed method can be applied in tasks where it is necessary to store and process vector representations of multimodal documents, for example, in the development of search engines.
Keywords
Постоянный URL
Articles in current issue
- Structural and spectral properties of YAG:Nd, YAG:Ce and YAG:Yb nanocrystalline powders synthesized via modified Pechini method
- Computational prediction in the problem of stereo image identification
- Comparison of application results of two speckle methods for study multi-cycle fatigue of structural steel
- Laser-induced thermal effect on the electrical characteristics of photosensitive PbSe films
- Homograph recognition algorithm based on Euclidean metric
- An improved performance of RetinaNet model for hand-gun detection in custom dataset and real time surveillance video
- Solving the problem of preliminary partitioning of heterogeneous data into classes in conditions of limited volume
- Correction of single error bursts beyond the code correction capability using information sets
- A novel strategic trajectory-based protocol for enhancing efficiency in wireless sensor networks
- Automation of complex text CAPTCHA recognition using conditional generative adversarial networks
- Deep attention based Proto-oncogene prediction and Oncogene transition possibility detection using moments and position based amino acid features
- Monocular depth estimation for 2D mapping of simulated environments
- Segmentation of muscle tissue in computed tomography images at the level of the L3 vertebra
- Providing operating modes for Coriolis vibration gyroscopes with low-Q resonators
- Collection and processing of environmental information in oil and gas production areas and solving other applied problems using active search methods (Review article)
- Using machine learning technologies to solve the problem of classifying infrasound background monitoring signals
- Study of the influence of the optical fiber output end shape on hydroacoustic processes in a liquid stimulated by microsecond pulses of Yb,Er:Glass laser radiation